Toxic Comment Filter

BiLSTM model for multi label classification
code
Deep Learning
Python, R
Author

Simone Brazzi

Published

August 12, 2024

1 Introduction

  • Costruire un modello in grado di filtrare i commenti degli utenti in base al grado di dannosità del linguaggio.
  • Preprocessare il testo eliminando l’insieme di token che non danno contributo significativo a livello semantico.
  • Trasformare il corpus testuale in sequenze.
  • Costruire un modello di Deep Learning comprendente dei layer ricorrenti per un task di classificazione multilabel.

In prediction time, il modello deve ritornare un vettore contenente un 1 o uno 0 in corrispondenza di ogni label presente nel dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In questo modo, un commento non dannoso sarà classificato da un vettore di soli 0 [0,0,0,0,0,0]. Al contrario, un commento pericoloso presenterà almeno un 1 tra le 6 labels.

2 Setup

Leveraging Quarto and RStudio, I will setup an R and Python enviroment.

2.1 Import R libraries

Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.

Code
library(tidyverse, verbose = FALSE)
library(tidymodels, verbose = FALSE)
library(reticulate)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(bslib)
library(Metrics)

reticulate::use_virtualenv("r-tf")

2.2 Import Python packages

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp

from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score


from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_score

Create a Config class to store all the useful parameters for the model and for the project.

2.3 Class Config

I created a class with all the basic configuration of the model, to improve the readability.

Code
class Config():
    def __init__(self):
        self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
        self.max_tokens = 20000
        self.output_sequence_length = 911 # check the analysis done to establish this value
        self.embedding_dim = 128
        self.batch_size = 32
        self.epochs = 100
        self.temp_split = 0.3
        self.test_split = 0.5
        self.random_state = 42
        self.total_samples = 159571 # total train samples
        self.train_samples = 111699
        self.val_samples = 23936
        self.features = 'comment_text'
        self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
        self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
        self.label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
        self.model =  self.path + "model_f1.keras"
        self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
        self.history = self.path + "lstm_model_f1.xlsx"
        
        self.metrics = [
            Precision(name='precision'),
            Recall(name='recall'),
            AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
            F1Score(name="f1", average="macro")
            
        ]
    def get_early_stopping(self):
        early_stopping = keras.callbacks.EarlyStopping(
            monitor="val_f1", # "val_recall",
            min_delta=0.2,
            patience=10,
            verbose=0,
            mode="max",
            restore_best_weights=True,
            start_from_epoch=3
        )
        return early_stopping

    def get_model_checkpoint(self, filepath):
        model_checkpoint = keras.callbacks.ModelCheckpoint(
            filepath=filepath,
            monitor="val_f1", # "val_recall",
            verbose=0,
            save_best_only=True,
            save_weights_only=False,
            mode="max",
            save_freq="epoch"
        )
        return model_checkpoint

    def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):

      # instantiate KFold
      kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
      threshold_scores = []

      for threshold in thresholds:

        cv_scores = []
        for train_index, val_index in kf.split(ytrue):

          ytrue_val = ytrue[val_index]
          yproba_val = yproba[val_index]

          ypred_val = (yproba_val >= threshold).astype(int)
          score = metric(ytrue_val, ypred_val, average="macro")
          cv_scores.append(score)

        mean_score = np.mean(cv_scores)
        threshold_scores.append((threshold, mean_score))

        # Find the threshold with the highest mean score
        best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
      return best_threshold, best_score

config = Config()

3 Data

The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.

Code
# df = pd.read_csv(config.path)
file = tf.keras.utils.get_file("Filter_Toxic_Comments_dataset.csv", config.url)
df = pd.read_csv(file)
Code
library(reticulate)

py$df %>%
  tibble() %>% 
  head(5)
Table 1: First 5 elemtns
# A tibble: 5 × 8
  comment_text            toxic severe_toxic obscene threat insult identity_hate
  <chr>                   <dbl>        <dbl>   <dbl>  <dbl>  <dbl>         <dbl>
1 "Explanation\nWhy the …     0            0       0      0      0             0
2 "D'aww! He matches thi…     0            0       0      0      0             0
3 "Hey man, I'm really n…     0            0       0      0      0             0
4 "\"\nMore\nI can't mak…     0            0       0      0      0             0
5 "You, sir, are my hero…     0            0       0      0      0             0
# ℹ 1 more variable: sum_injurious <dbl>

Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.

Code
df.loc[df.sum_injurious == 0, "clean"] = 1
df.loc[df.sum_injurious != 0, "clean"] = 0

3.1 EDA

First a check on the dataset to find possible missing values and imbalances.

3.1.1 Frequency

Code
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels

df_r_grouped <- df_r %>% 
  select(all_of(new_labels_r)) %>%
  pivot_longer(
    cols = all_of(new_labels_r),
    names_to = "label",
    values_to = "value"
  ) %>% 
  group_by(label) %>%
  summarise(count = sum(value)) %>% 
  mutate(freq = round(count / sum(count), 4))

df_r_grouped
Table 2: Absolute and relative labels frequency
# A tibble: 7 × 3
  label          count   freq
  <chr>          <dbl>  <dbl>
1 clean         143346 0.803 
2 identity_hate   1405 0.0079
3 insult          7877 0.0441
4 obscene         8449 0.0473
5 severe_toxic    1595 0.0089
6 threat           478 0.0027
7 toxic          15294 0.0857

3.1.2 Barchart

Code
library(reticulate)
barchart <- df_r_grouped %>%
  ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
  geom_col() +
  labs(
    x = "Labels",
    y = "Count"
  ) +
  # sort bars in descending order
  scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
  scale_fill_brewer(type = "seq", palette = "RdYlBu") +
  theme_minimal()
ggplotly(barchart)
Figure 1: Imbalance in the dataset with clean variable

It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.

It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.

3.2 Sequence lenght definition

To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.

One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.

3.2.1 Summary

Code
library(reticulate)
df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  pull(text_length) %>% 
  summary() %>% 
  as.list() %>% 
  as_tibble()
Table 3: Summary of text length
# A tibble: 1 × 6
   Min. `1st Qu.` Median  Mean `3rd Qu.`  Max.
  <dbl>     <dbl>  <dbl> <dbl>     <dbl> <dbl>
1     4        91    196  378.       419  5000

3.2.2 Boxplot

Code
library(reticulate)
boxplot <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  # pull(text_length) %>% 
  ggplot(aes(y = text_length)) +
  geom_boxplot() +
  theme_minimal()
ggplotly(boxplot)
Figure 2: Text length boxplot

3.2.3 Histogram

Code
library(reticulate)
df_ <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
  )

Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)

histogram <- df_ %>% 
  ggplot(aes(x = text_length)) +
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal() +
  xlab("Text Length") +
  ylab("Frequency") +
  xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)
Figure 3: Text length histogram with boxplot upper fence

Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.

3.3 Dataset

Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.

Code
x = df[config.features].values
y = df[config.labels].values

xtrain, xtemp, ytrain, ytemp = train_test_split(
  x,
  y,
  test_size=config.temp_split, # .3
  random_state=config.random_state
  )
xtest, xval, ytest, yval = train_test_split(
  xtemp,
  ytemp,
  test_size=config.test_split, # .5
  random_state=config.random_state
  )

xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape

The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.

Code
train_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtrain, ytrain))
    .shuffle(xtrain.shape[0])
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

test_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtest, ytest))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

val_ds = (
    tf.data.Dataset
    .from_tensor_slices((xval, yval))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
Code
print(
  f"train_ds cardinality: {train_ds.cardinality()}\n",
  f"val_ds cardinality: {val_ds.cardinality()}\n",
  f"test_ds cardinality: {test_ds.cardinality()}\n"
  )
train_ds cardinality: 3491
 val_ds cardinality: 748
 test_ds cardinality: 748

Check the first element of the dataset to be sure that the preprocessing is done correctly.

Code
train_ds.as_numpy_iterator().next()
(array([b'I have AT&T;/Bonnaroo in the corner of my Pandora screen ...',
       b"in response to\n\nThere is a persistent meme in this article and others that models were introduced in response to some model from another company. That may be possible, but the lead time for a set of wheels and some pinstripes is 12 months, any real change is 2 years, and a fundamental change is more like 4 years. So this game of pingpong beloved of wiki editors really doesn't exist. Simple question, if pingpong worked, why did it take Toyota so long to bring the 4wd Kluger out?",
       b'"\n\nRandyKitty, none of these ""collaborators"" did anything. I welcome their collaboration! Let them add a single line, or even delete a line that they think is unimportant! Mass deleting is not a ""collaboration"" and you are abusing language to make your point. I am the only one who collaborated on here. Take a look at the record, and you shall see I collaborated with at least 4 other editors before 2 ""collaborators"" (who are collaborators in the sense that the French were collaborators with Hitler) simply mass deleted contributions, rather than selectively keeping what they liked or approved of. Please stop abusing language.   "',
       b'"\nAnd yet it does, and will continue to, until there is something else called ""PS4"" requiring a disambig page.  Get over it. \xe2\x80\x94\'\'\'\'\'\'T C W"',
       b'"\nAh, thanks for catching that! I guess I should have tried to read the whole thing before tagging it. {{nowrap|\' talkcontribs}} \nYes always worth reading the whole thing. Sometimes everything can seem totally reasonable till you see a final barbed sentence at the end, othertimes a person can seem utterly unnotable until you see a final sentence such as ""best known for winning silver in the 1920 Olympics"" SpielChequers \n\n Speedy deletion declined: Meghji Pethraj Shah \nHello RA0808. I am just letting you know that I declined the speedy deletion of Meghji Pethraj Shah, a page you tagged for speedy deletion, because of the following concern: famous is an assertion of importance.\'\'\'  Thank you.  SpielChequers\'\' "',
       b"Real Chance of Love 2 cast photo \n\ncan you please remove the deletion note. Because that's the actual cast of the upcoming show.",
       b'Oh, and I guess the few thousand words you put into Extreme Rules (2012) were good too. Whatever.',
       b'"== With regards to ""Very Long"" tag==\nHowever, there might be a lot to say about Senator Clinton!  The wikipedia Yom Kippur War article is long yet it was a very short event of days, compared to Clinton\'s nearly 60 years of life.  Vote:  Do not shorten for the sake of shortening. \n\nI think the ""Very Long"" tag may date back to before when material was split off into subarticles.  The article is still kind of long even after that, but manageable I think.  However imagine what will happen if she gets elected President!  All that has come before will be dwarfed by the new material to come. \n\nRegarding the ""CV section"" that you ( \n\nI was orginally very confused about her background because it takes work to figure out when she was practicing law and if it overlapped with being First Lady.  A short list of jobs is useful.  What\'s there to hide.  She was never in prison.  She worked for a prestigous law firm. \n\nHer practicing law did overlap with being First Lady of Arkansas, but not with being First Lady of the United States.    \n\nThe tag is meant for the article page, not the talk page. The article is currently around 67kb, which is long. Talk/Contribs \n\nReally being wikipedia NPOV: we need to treat people fairly, not have a double standard\nThe opening paragraph has been very slightly modified to match that of Christopher Dodd, another Democratic presidential contender.  We should treat them fairly and the same.  We still give Senator Clinton the honor of mentioning that she\'s first lady and has several first.\n\nI don\'t see how you can be against it unless you aren\'t fair.  If this isn\'t the case and you want to change it, explain why inconsistency is fair? \n\nWhy are you picking only Dodd?  Look at Barack Obama, John Edwards, John Kerry ... I\'ve looked at `1/3 to 1/2 of all the Senator articles, and all the 2008 presidential candidate articles, and there is absolutely no consistency among them in their intros.  But let me ask you this, what is your objection to the current intro?    \n\nDodd is from CT, that borders NY.  First, we make consistency between Dodd and Clinton, then we\'ll worry about the others?  I can\'t see how you can be for inconsistency.  It just isn\'t fair!  And fairness and NPOV is what wikipedia is all about.  What\'s so bad with Dodd\'s intro?  It isn\'t negative or inaccurate. \n\nInconsistency of arrangement of neutral and unquestioned biographical details in an intro is not a matter of fairness or NPOV - it\'s just plain old inconsistency - so your argument is a red herring.  Your chance of getting all these articles to have consistent intros is zero, take my word for it.  Your revised intro to this article wasn\'t as good as the existing one, because it was a bit redundant and had faulty wikilinking.  Again, what is your agenda here?  What do you think is wrong with what we have?      \n\nFor one thing, Dodd is called a politician, which he is.  Hillary\'s occupation as politician is missing.  Someone is really a politician when they run for more than one office.  Jesse Ventura is difficult to call a politician cuz he just ran for one office then went back into wrestling or related jobs. \n\nYou are wrong about Ventura.  He was a mayor before he was governor and he was also involved in the shaping of third-party political organizations.  So he is a politician just as much as HRC or anyone else.  For the most part, however, I think it is redundant to say someone is a politician when you say they are a senator and running for president - obviously they are politicians.  I just sampled a bunch of articles, and maybe 1/3 use \'politician\' in the intro.  Being Wikipedia, it\'s ve',
       b'Interesting discussion. I was curious how this started with the people who deleted the tree, and found that independant views are not what they seem. I did a quick check at their talk pages -> [] and the first thing we see are buddy chats that indicate they certainly help each other achieve their editing/delete objectives, ie sonia,Ktr101,Off2riorob. Its unfortunate that sg editors are getting fewer and less active than previous years. Hopefully sg articles in wiki will not get dominated by foreigners.',
       b'"\nI\'m sorry, but that would be utterly retarded.  Clannad (the band), the name clearly explained multiple times in their history as an amalgamation of the Gaelic Clann As Dobhar, has been active since 1972 and has spearheaded a specific style of music on their own and additionally as an influence to many other bands.  ""Clannad"", the novel/game, was created in 2004 with NO explanation of why they chose that name for the Latin-alphabet spelling, other than a rumored summation that the creator may have liked Clannad (the band). \xe2\x82\xaa\xe2\x80\x94  (T\xc2\xb7C)  """',
       b"I fixed it for you.  You have to manually name the reference tag, because the software isn't smart enough to guess which one you're referring to.  (This is probably a feature, since you cite two different publications by Naylor, and odds are 50-50 that it would guess wrong if it were capable of guessing.)",
       b'"\n\n Myntra ad \n\nDid you added {{advert|date=February 2012}} tag to myntra? I have made some changes removing all the brand names (except Fastrack and being human as they were info) which seemed to advertise.\nPlease check it once and then remove the tag if it doesn\'t seem to advertise.Thanks!\xc2\xa0  \xe2\x9c\x89 let\xe2\x80\x99s talk about it \xc2\xa0:) "',
       b'Patrick J. Kennedy\nWhy did you delete my edit to the Patrick Kennedy article? My edit was completely factual. Homosexual Providence mayor David Cicilline will replace him in January 2011, and that is all that I wrote. I have undid your revision. Please do not erase my edits like this in the future. Thank you. 173.71.93.148',
       b"If making sure that an article is baffling and irrelevant to people who aren't Australian politics nerds is what turns you on, then carry on, I hope it brings you mindblowing pleasure.  190.44.133.67",
       b"And further more I am not the main contributor here. I just wanted to add one section. Its been re-done to become neutral. Conflict of interest? Because I'm writing about something I know? I'm sure anything you write about I would hope you have some interest in it before getting sources or not. What do you specialize in Orange Mike? Do you have hobbies? This is getting a little too ridiculous now. I expect the WP:Whining to be thrown my way next.-",
       b'"\n\n Hey donkey \n\nHey donkey, your Dad is coming.  \xe2\x80\x94\xc2\xa0Preceding unsigned comment added by 117.196.229.56   \n\n Enjoying the creepy messages? \n\nEnjoying all these creepy messages son?  \xe2\x80\x94\xc2\xa0Preceding unsigned comment added by 117.196.229.56   "',
       b'Zero is a valid code point, and the UTF-8 encoding of it is a NUL (`\\0`) byte. An 0xC0, 0x80 sequence in a UTF-8 is an invalid overlong encoding. In fact, the overlong encoding of NUL is used as an example of a security issue from an incorrect UTF-8 implementation in the RFC.',
       b'Also note that recent pictures of Zimmerman, eg (http://www.trbimg.com/img-1332543851/turbine/george-zimmerman-20120323), clearly do not show an obese man who weighs 250 lbs. 72.130.4.42',
       b'dude, the information is blatantly false, you can read the book for free, it says nothing of the sort.',
       b'Offer free pics of you if people donate 131.151.66.248',
       b'wow great=\nwow great.. you are a copyright superman.. next nobel prize for copyright is for you! cheers..',
       b"I told you I haven't edited any comment. Some comments in past have been edited by you. I can see that. And let other edit Kushwaha article so that it doesn't give false look it is having now. I hope this is this ok.",
       b'"\n\n I\'m strongly opposed to any furtherance of the pretence that collaborative editing requires going around getting blessings for one\'s actions off of people. There\'s nothing special about your having left a ""per above"" comment on that TfD which makes your being notified of prime importance to future work on it. Indeed, that it resulted in my being first called ""disingenuous"" and then casually asked to reconsider my dedication to the encyclopedia only confirms to me that asking the peanut gallery is not a productive use of my time. Chris Cunningham (not at work) - talk "',
       b'"\n\n Why has all the technical info been removed oover time? \n\nI was suprised to just see ""erotic"" vibrators in this article, and in selecting the history at randowm, I see that it used to contain, much more interesting uses of the term, (compators for constuction, joggers for paper, cell phone allert devices (including pictures) and the application I think of forvibrator, a switching device for a battery operated power supply, example battery operated radios...\n\nWhy the single mainded persuit of one fringe application?"',
       b'"\n\n ""Bradshaw\'s 6 yard go ahead touchdown run with 57 seconds left helped lead the New York Giants over the New England Patriots in Super Bowl XLVI"" \n\nTrue enough - but should the article perhaps point out that he was trying not to score at the time...   "',
       b'"\n\n Your civility \n\nI find your general tone very uncivil, rude and threatening. There is no need to be as demanding and generally confrontive as you are being. Please read Wikipedia:Civility - I think you\'d benefit from it. -\'(contribs) \n\nI still find your general tone quite confrontive, rude, threatening and generally uncivil. Perhaps you should also read m:Wikistress and consider a Wikibreak until you have calmed down considerably, as well as reading WP:CIVIL. -\'(contribs) \n\nThe fact is that I feel threatened and upset by your actions, as I find your general tone very rude and dismissive. The fact that someone can get away with being such a rude individual to many other editors (not just myself) is beyond me. I still seriously suggest that you read m:Wikistress, WP:BREAK and WP:CIVIL, especially the latter, and assess your own behaviour if you don\'t ""need"" anyone else to do it for you, as your general uncivility will only cause you problems in the long run - maybe reading those three pages will give you something to think about and you might see fit to correct your behaviour. -\'(contribs) \n\nI didn\'t specify that I felt physically threatened, but your irrational and extremely rude behaviour is intimidating. -\'(contribs) "',
       b'"\n\nEhem, ""correct"" spelling? May I remind you that English was invented in England and American spelling only came about because someone was too lazy to learn so wrote his own dictionary. I know kids tried to do that to get around school classes but for an whole nation! Sorry, but it is not correct, it is a different language! p\nSorry, had to make that rant (that\'s the short version!), and it is UK-IE spelling to be exact as all European articles follow the English used in Europe, not America. ""The Union"" and ""EU"" is nothing to do with spelling, it is just has better flow in my opinion. ""EU"" isn\'t a very good acronym. -  t: \nIt was meant to be inflammatory, but I figured I could get away with a little good-natured ribbing. ) You\'re right it\'s not so much spelling, but probably more accurate to compare it to using odd words like ""lift"", ""petrol"", ""bonnet"", and ""boot"" for the more proper ""elevator"", ""gas"", ""hood"", and ""trunk"". I\'ve never heard anyone on this side of the pond call it simply ""the Union"", but that\'s no doubt because we don\'t talk about it that much at all. Regardless, it would be quite clear from context that ""the Union"" == ""the European Union"".  (talk|contribs) "',
       b"Stooges are the nickname for the fans on Murphy's site... and if you want to feel all self pitying we can end this conversation now and go down the path.   Here is what you are not getting- Shitapedia is NOT the FBI or my University.  If you want to have an encylopedic entry on me then that entry HAS TO be accurate.  Simply check the laws- you have HUGE leeway to defame celebrities and public figure and you do so freely.  BUT that stops with a non public figure.  You can try to have your cult argue that I am a public figure, but when 100 out of 100 randomly selected people can't identify me, then that game is over.  As a non public figure you do NOT have much leeway as regards defaming me.  I don't need a specialist- if you are going to keep a file on me it is going to be a file that I sign off on- period.  Unfortunately for your cultists, I have some resources to fight you on that, even if it is just sending people to vandalize.  I am also trying to work in mocking references to the site in major films.  Anything to undermine you.  You want me to play with you.  This is my name, my career, my rights- you don't get to tell me how to play.  Capisce?",
       b'In the discussion it is also noted that there is a request for mediation active which had not been responded to yet, and surely this has to be considered before the discussion is closed?',
       b'"\n\n I appreciate the support.  I have added this item to WP:BLPN per the instructions found in Wikipedia:BLP#Remove_unsourced_or_poorly_sourced_contentious_material, rather than attempting to handle the problem myself.  I note that this same reference is being used to include content above as well, and so will apply to that content as well.\n\nClearly it will be impossible for me to prove a negative, i.e. that no such verification exists, nor should I even have to attempt to do so since, per per Wikipedia:Verifiability#Burden_of_evidence that ""The burden of evidence lies with the editor who adds or restores material.""\n\nIt is also clear by simple googling that this particular reference is becoming detrimental to the reputation of Wikipedia itself as I can find numerous sources that refer to this exact page and this exact topic.    "',
       b'BLM Platinum Continues to Paste Material onto My Discussion Page \n\nHi. It appears BLM Platinum did not fully understand your comment and has continued to post things on my discussion.71.48.135.174',
       b'"Welcome\n\nHello and welcome to Wikipedia!  We appreciate encyclopedic contributions, but some of your recent edits do not conform to our policies.  For more information on this, see Wikipedia\'s policies on vandalism and limits on acceptable additions. If you\'d like to experiment with the wiki\'s syntax, please do so on Wikipedia:Sandbox rather than in articles.\n\nIf you still have questions, there is a new contributor\'s help page, or you can write {{helpme}} below this message along with a question and someone will be along to answer it shortly.  You may also find the following pages useful for a general introduction to Wikipedia.\nThe five pillars of Wikipedia\nHow to edit a page\nHelp pages\nTutorial\nHow to write a great article\nManual of Style\nPolicy on neutral point of view\nGuideline on external links\nGuideline on conflict of interest\nI hope you enjoy editing Wikipedia!  Please sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. Feel free to write a note on the bottom of  if you want to get in touch with me. Again, welcome!   Tropics "'],
      dtype=object), array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]))

And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).

Code
print(
  f"text train shape: {train_ds.as_numpy_iterator().next()[0].shape}\n",
  f" text train type: {train_ds.as_numpy_iterator().next()[0].dtype}\n",
  f"label train shape: {train_ds.as_numpy_iterator().next()[1].shape}\n",
  f"label train type: {train_ds.as_numpy_iterator().next()[1].dtype}\n"
  )
text train shape: (32,)
  text train type: object
 label train shape: (32, 6)
 label train type: int64

4 Preprocessing

Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.

For more reference, see the documentation at the following link.

Code
text_vectorization = TextVectorization(
  max_tokens=config.max_tokens,
  standardize="lower_and_strip_punctuation",
  split="whitespace",
  output_mode="int",
  output_sequence_length=config.output_sequence_length,
  pad_to_max_tokens=True
  )

# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)

This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.

To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.

Code
processed_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)

5 Model

5.1 Definition

Define the model using the Functional API.

Code
def get_deeper_lstm_model():
    clear_session()
    inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
    embedding = Embedding(
        input_dim=config.max_tokens,
        output_dim=config.embedding_dim,
        mask_zero=True,
        name="embedding"
    )(inputs)
    x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
    x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
    # Global average pooling
    x = GlobalAveragePooling1D()(x)
    # Add regularization
    x = Dropout(0.3)(x)
    x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = LayerNormalization()(x)
    outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
    model = Model(inputs, outputs)
    model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
    
    return model

lstm_model = get_deeper_lstm_model()
lstm_model.summary()

5.2 Callbacks

Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve the best model training information.

Code
# callbacks
my_es = config.get_early_stopping()
my_mc = config.get_model_checkpoint(filepath="/checkpoint.keras")
callbacks = [my_es, my_mc]

5.3 Final preparation before fit

Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.

Code
lab = pd.DataFrame(columns=config.labels, data=ytrain)
r = lab.sum() / len(ytrain)
class_weight = dict(zip(range(len(config.labels)), r))
df_class_weight = pd.DataFrame.from_dict(
  data=class_weight,
  orient='index',
  columns=['class_weight']
  )
df_class_weight.index = config.labels
Code
library(reticulate)
py$df_class_weight
Table 4: Class weight
              class_weight
toxic          0.095900590
severe_toxic   0.009928468
obscene        0.052757858
threat         0.003061800
insult         0.049132042
identity_hate  0.008710911

It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid to not consume entirely the dataset during the fit, which happened to me.

Code
steps_per_epoch = config.train_samples // config.batch_size
validation_steps = config.val_samples // config.batch_size

5.4 Fit

The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:

  • .repeat() ensure the model sees all the dataset.
  • epocs is set to 100.
  • validation_data has the same repeat.
  • callbacks are the one defined before.
  • class_weight ensure the model is trained using the frequency of each class, because our dataset is imbalanced.
  • steps_per_epoch and validation_steps depend on the use of repeat.
Code
history = model.fit(
  processed_train_ds.repeat(),
  epochs=config.epochs,
  validation_data=processed_val_ds.repeat(),
  callbacks=callbacks,
  class_weight=class_weight,
  steps_per_epoch=steps_per_epoch,
  validation_steps=validation_steps
  )

Now we can import the model and the history trained on Kaggle.

Code
model = load_model(filepath=config.model)
history = pd.read_excel(config.history)

5.5 Evaluate

Code

validation = model.evaluate(
  processed_val_ds.repeat(),
  steps=validation_steps, # 748
  verbose=0
  )
Code
val_metrics <- tibble(
  metric = c("loss", "precision", "recall", "auc", "f1_score"),
  value = py$validation
  )
val_metrics
Table 5: Model validation metric
# A tibble: 5 × 2
  metric     value
  <chr>      <dbl>
1 loss      0.0542
2 precision 0.789 
3 recall    0.671 
4 auc       0.957 
5 f1_score  0.0293

5.6 Predict

For the prediction, the model does not need to repeat the dataset, because it has already been trained on all of the train data. Now it has just to consume the new data to make the prediction.

Code

predictions = model.predict(processed_test_ds, verbose=0)

5.7 Confusion Matrix

The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.

5.7.1 Grid Search Cross Validation for best threshold

Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.

The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.

5.7.2 Confidence threshold and Precision-Recall trade off

Whilst the KFold GDCV technique is usefull to test multiple hyperparameter, it is important to understand the problem we are facing. A multi label deep learning classifier outputs a vector of per-class probabilities. These need to be converted to a binary vector using a confidence threshold.

  • The higher the threshold, the less classes the model predicts, increasing model confidence [higher Precision] and increasing missed classes [lower Recall].
  • The lower the threshold, the more classes the model predicts, decreasing model confidence [lower Precision] and decreasing missed classes [higher Recall].

Threshold selection mean we have to decide which metric to prioritize, based on the problem we are facing and the relative cost of misduging. We can consider the toxic comment filtering a problem similiar to cancer diagnostic. It is better to predict cancer in people who do not have it [False Positive] and perform further analysis than do not predict cancer when the patient has the disease [False Negative].

I decide to train the model on the F1 score to have a balanced model in both precision and recall and leave to the threshold selection to increase the recall performance.

Moreover, the model has been trained on the macro avarage F1 score, which is a single performance indicator obtained by the mean of the Precision and Recall scores of individual classses.

\[ F1\ macro\ avg = \frac{\sum_{i=1}^{n} F1_i}{n} \]

It is useful with imbalanced classes, because it weights each classes equally. It is not influenced by the number of samples of each classes. This is sette both in the config.metrics and find_optimal_threshold_cv.

5.7.2.1 f1_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_f1, best_score_f1 = config.find_optimal_threshold_cv(ytrue, y_pred_proba, f1_score)

print(f"Optimal threshold: {optimal_threshold_f1}")
Optimal threshold: 0.15000000000000002
Code
print(f"Best score: {best_score_f1}")
Best score: 0.4788653077945807
Code

# Use the optimal threshold to make predictions
final_predictions_f1 = (y_pred_proba >= optimal_threshold_f1).astype(int)

Optimal threshold f1 score: 0.15. Best score: 0.4788653.

5.7.2.2 recall_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_recall, best_score_recall = config.find_optimal_threshold_cv(ytrue, y_pred_proba, recall_score)

# Use the optimal threshold to make predictions
final_predictions_recall = (y_pred_proba >= optimal_threshold_recall).astype(int)

Optimal threshold recall: 0.05. Best score: 0.8095814.

5.7.2.3 roc_auc_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_roc, best_score_roc = config.find_optimal_threshold_cv(ytrue, y_pred_proba, roc_auc_score)

print(f"Optimal threshold: {optimal_threshold_roc}")
Optimal threshold: 0.05
Code
print(f"Best score: {best_score_roc}")
Best score: 0.8809499649742268
Code

# Use the optimal threshold to make predictions
final_predictions_roc = (y_pred_proba >= optimal_threshold_roc).astype(int)

Optimal threshold roc: 0.05. Best score: 0.88095.

5.7.3 Confusion Matrix Plot

Code
# convert probability predictions to predictions
ypred = predictions >=  optimal_threshold_recall # .05
ypred = ypred.astype(int)

# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(ax=axes[i], colorbar=False)
    axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()
Figure 4: Multi Label Confusion matrix

5.8 Classification Report

Code

cr = classification_report(
  ytrue,
  ypred,
  target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  )
Table 6: Classification report
# A tibble: 10 × 5
   metrics       precision recall `f1-score` support
   <chr>             <dbl>  <dbl>      <dbl>   <dbl>
 1 toxic            0.552  0.890      0.682     2262
 2 severe_toxic     0.236  0.917      0.375      240
 3 obscene          0.550  0.936      0.692     1263
 4 threat           0.0366 0.493      0.0681      69
 5 insult           0.471  0.915      0.622     1170
 6 identity_hate    0.116  0.720      0.200      207
 7 micro avg        0.416  0.896      0.569     5211
 8 macro avg        0.327  0.812      0.440     5211
 9 weighted avg     0.495  0.896      0.629     5211
10 samples avg      0.0502 0.0848     0.0597    5211

6 Conclusions

The BiLSTM model is optimized to have an high recall is performing good enough to make predictions for each label. Considering the low support for the threat label, the performance is not bad. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.

Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.